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AESTRACT - * * 

In Eome testing programs an early item analysis is 
performed before final scoring in order to validate the intended 
keys As a result, some items which are flawed and do not 
discriminate well may be keyed so as to give credit to examinees 
matter which answer was chosen. This is referred to as allkeying 
This research examined how varying the nwnbers of allkeyed it® 
affects the equating function and resulting equated scores. The 
experimental conditions consisted of allkeying 0, i, 10, and 25 
items. The examination was a 200-item multiple choice licensing 
examination. Over 3,500 examinee records were studied. The results 
showed virtually no differences In scaled score means across the 
experimental conditions. Although the equating procedures compensated 
for the changes that occurred as more Items were allkeyed, the effect 
of allkeying on an individual's scaled score will depend on the 
individual's performance on the allkeyed items. The results suggest 
that an item should not be allkeyed unless It is. clear that there is 
no defensible answer among the options. (Author/GDC) 
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ABSTRACT 



In some testing programs an "early Item analysis" Is parformed before final 
scoring in order to validate the intended keys. As a result, some Items may 
be keved so as to give credit to examinees no matter which answer was 
chosen. (This is referred to as allkevlng in this paper.) The purpose of 
this research Is to examine how varying the numbers of allkeyed items affectb 
the equating function and resulting equated scores. The exparimancal 
conditions consisted of allkeylng zero, four, ten, and twenty-five items. The 
results showed virtually no differences In scaled score means across the 
experimental conditions. Although the equating procedures compensated for the 
changes that occurred as more Items were allkeyed, the effect of allkeymg on 
an individual's scaled score will depend on the individual's performance on 
the allkeyed Items. The results suggest that an item should not ba allkeyed 
unless It is clear that there is no defensible answer among the options. 
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INTRODUCTION 



In many standardized teaclng programs, an "early item analysis" Is 
performed before final scoring In order to ensure the quality and fairness of 
the items In the examination. The purpose of this item analysis Is to Identify 
icems that performed poorly, as Indicated by indices of difficulty and 
discrimination. If a review of the item content reveals that an Item Is flawed, 
Che Item may be scored all options correct (hereafter referred to as 
allkeylng). The number of allkeyed items varies from one test form to another. 

Although the practice of allkeylng items prior to equating and final 
scoring is not uncommon in standardized testing, the effects of allkeylng on 
equating functions and equated scores have received little attenclon in the 
literature. Dorans (1983) examined the effect of deleting an item (i.e., 
scoring an item eicher all options correct or no options correct^ on 
equating/scaling functions when IRT equating procedures were used. He found 
that the effect of deleting an item was dependent on the charactecistlcs of 
the deleted item (I.e., difficulty, discrimination, and lower assympcote of 
the item characteristic curve) and the scoring method used (I.e., no options 
vs. all options correct). Dorans also found that when a flawed item was 
discovered after the equating process was completed, the change in scaled 
scores was much smaller when a new oquaclng function was determined than when 
Chd .'tern was simply rescored either no options correct or all options correcc. 

Dorans was concerned with the effect on IRT true score equating of 
allkeylng (deleting) a single item that was identified as flawed only after 
equating and final scoring had occurred. The present study Investigates the 
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effect on linear equating of allkeying several items that have been identified 
as flawed, based on statistical Indices, before equating and final sGoring, 
The purpose is to examine how varying the number of allkeyed Icems affects the 
equating function and Individuai scaled scores, 

^ffiTHOD 

The data in this study are from a nationally adrainlstered licensure 
examination administered co more than 18,000 candidates, A spaced sample of 
3,588 examinee records was selected for this study. The examination is 
composed of 200 multiple-choice Items, Each item is classified into one of 
six content areas. Two of the content areas each contain 20 percent of the 
total items while the ocher four content areas each contain 15 percent of the 
total items. Forty of the items were chosen from a previous test form and 
constitute an internal anchor that is used to equate scores on the current 
form to a standard score scale. The equators were chosen to be both 
stacistically and content-representative of the complete form from which they 
were chosen. 

In this study, the experimental conditions consisted of allkeying zero, 
four, ten, or twenty-'five Items, Although scoring twenty^flve items all 
options correct rarely occurs in practice, this rondition was included for 
theoretical Interest. The specific items chosen for the four allkeying 
condicions were among the items flagged during tha early item analysis as 
statistically questionable. None of the allkeyed Items was an equator. In 
this study, flavjad items were scored all options correct because that is 
standard practice on this licensure exam. Another option would be to score no 
options correct. If raw scores are equated, the choic^i of scoring method is 
arbitrary. Items scored all options or no optiuns correct have no 
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dlffarential psychometric impact on examinee scores, Eithar scoring method 
effectively deletes the items from the test, and the results, after equating, 
are identical • 

A second consideration in choosing groups of items to be ' Che 

distribution of the items across content areas* When the e^ - ^ " 
selected for chis exam, they were chosen to reflect the pe ' t: 

In each content area^ However, the allkeying of Items up ' 
between the equators and the full test, Klein and Jarjc ^ ^ tat 

anchors that were not representative of the test as a w o -snl 
Inaccurate equating. Representativeness was defined ir . 

distribution of items across content areas. In a reprc. ^ . - ^hor the 

percentage of equating items in each content area ref lac l v. centage of 

items in each content area for the full exam. 

In view of this finding, an attempt was made to balance the allkeyed 
items across the six content areas. However, this was not entirely possible 
because some content areas had very few flagged Items, The number of allkeyed 
items in each content area for the four conditions . is listed In Table 1. 
Because complete balancing was not possible, a small degree of 

nonrepresentativeness was introduced. In general, the items that are allkeyed 
as a result of an early item analysis are distributed across content areas and 
thus Introduce only slight nonrep resantativeness. However, there are other 
rekeylng situations that could have a more serious effect on the 
representativeness of equators. For example, if a group of items from a 
single content area <e.g, a multi-item set) was allkeyed, the balance between 
the percentage of equators and total Items in that content area would be 
upset . The findings of Klein and Jarjoura suggest that such an occurrence 
could affect the accuracy of equating. In order to test this hypothesis, a 
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fifth condition was added in which all the allkeyed items were chosen from a 
single content area. Two multi--ltem sets, one containing four items and the 
other five items, were allkeyed. The nine allkeyed items represented 23 
percent of the total number of items in that content area. Again, none of the 
nine Items was an equfiitor. The statistical characteristics of the allkeyed 
items in the single content area condition (SC) differed somewhat from the 
other allkeyed items. These items were not originally flagged as 
statistically questionable and therefore tended to have higher difficulty 
values and higher indices of discrimination. Table 1 also presencs the 
average Item difficulty and discrimination of the allkeyed items for each of 

the five conditions. 

All 3,588 records were scored under each of the five allkeying 
conditions* Following rascoring, equating functions were derived for each 
condition using two linear equating procedures I the Tucker method and the 
Levlne equally reliable method. These methods were chosen because they are 
the methods generally employed in equating the eKamlnatlon used in this 
study. After the equating functions had been derived, basic summary 
statistics were obtained as well as raw and equated scores at specific points 
on the score scale, 

RESULTS 

The raw score means and standard deviations, equated score means and 
standard deviations, and slopes and Intercepts for the two linear equating 
procedures and five allkeying conditions are shown in Table 2, As expected, 
raw score means increased as greater numbers of items were allkeyed. For the 
two extreme conditions of ^ero and 25 allkeyed items, the raw score means were 
123,537 and 140,108, respectively. However, for both methods of equating, 
equated score means remained virtually unchanged across the five conditions, 
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The largest difference in scaled score means was less than .03. Ic is clear 
that, with refard to mean scores, the equating procedures were successful in 
compensaulng for the changes in difficulty that occurred as more items were 
allkeyed* 

The compensacory effect is more obvious if a comparison is made of the 
scaled scores that would be obtained for the same raw score In two different 
allkeytng conditions. A raw score of 130 converts to a scaled scoie of 144 
(using the Tucker method) If no Items ars allkeyed. When ten items are 
ailkeyed, a raw score of 130 converts to a scaled score of 138. This 
difference occurs because the latter test Is easier. To receive a scaled 
score of 144 on the test in which ten : ueras were allkeyed, an examinee would 

need to obtain a raw score of 137. 

Although the equating procedures do compensate j on the average, for 
changes introduced by allkeyed items, it is Informative to look at the effect 
of allkeylng on Individual examinees. Table 3 shows the effect of allkeying 
items on the equated scores of two hypothetical examinees. It was assumed 
that both examinees obtained raw scores of 120 when no Items were allkeyed. 
It was further assumed that examinee A chose the original key on every lte,.i 
that was later allkeyed , while examinee B chose a response other than the 
original key on the allkeyed Items. The table shows that both examinees would 
obtain an equated score of 135 if no Items were allkeyed. At the extreme of 
twenty-five allkeyed Items, examinee A would receive an equacnd scor? of 118 
rather than the original 135, while examinee B would receive an equactid score 
of 143. This outcome is approprlace if the items were allkeyed because there 
was no correct response. In that case, an examinee who chose the originul 
"correct" key would deserve no more credit than an examinee who chose an 
"incorrect" response. However, if an Item were allkeyed because it did not 
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work statistically, i.e., had very low indices of difficulty and 
discriroination, but still had a Justifiably correct answer, the result would 
be to penallEe chose examinees who knew that answer and reward those who did 
not. 

A comparison of the zero allkeylng condition and the single content area 
condition In Table 2 shows very little difference in scaled score means and 
standard deviations. The scaled score maans for these two conditions differed 
by less than .01* 

DISCUSSION 

The parpose of this study was to assess the effect of allkeylng Items on 
Che linear equating function and individual scaled scores • The results showed 
that even in the extreme condition In which 25 items were allkeyed, the linear 
equating procedures are sufficiently robust to compensate for the changes 
Introduced by allkeylng. Although from a practical point of view it Is 
encouraging to find that equating "works '% it Is somewhat surprising to find 
no effect even for the most extreme condition. One reason that allkeylng 
these items may have had little effect on equating is that these items were 
contributing little to the test In the first place. Some support for this 
hypothesis can be found by examining the summary statistics in Table 1 and the 
^--20 reliability coefficients for the examination under the four original 
allkeylng conditions* The summary statistics show that che mean point 
biserial for the allkeyed Items in all four conditions is less than .10* 
These items clearly do not discriminate between the good and poor examinees. 
The KR-20 values for the 0, 4, 10 and 25 item allkeylng conditions were .865, 
,867, .869, and .870, respectivelyp Although the test effectively gets 
shorter as more items are scored all options correct, the reliability 
coefficients increase. This increase may indicate that the allkeyed items 
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were measuring something different than the other items and thus introduced 
noise Into the measureraeiit process. In any case, the change in reliability 
from 0 allkeyed icems to 25 allkeyed items is quite small, only ,005, 

Although Che allkeying of Items had little effect on the mean scaled 
scores, it would be a mistake to conclude that allkeying had no efface. As 
the rasults In Table 3 show, the effect of allkeyinf items on an individual's 
scaled score depends on the individual's original response choice. Allkeying 
results in a decrease in scaled scores (relative to no allkeys) for 
Individuals who chose the original "correct" response and in an increase in 
scaled scores for individuals who chose a response other than the key. The 
decrease in scaled scores for individuals choosing the "correct" response 
outcome can only be justified if the allkeyed items truly have no correct 
answer and the original key is no more correct than any of the distractors, A 
decision to allkey should be based on a consideration of the item content, not 
solely on the Item statistics. 

The failure to find an effect of allkeying on scaled scored means in the 
single content area allkeying condition is Inconsistent with the Klein and 
Jarjoura study. Klein and Jarjoura found that non-representative anchors 
resulted in inaccurate equating. A partial explanation for this contradiction 
may be found in the degree of non-representativeness of the anchor forms. The 
percentages of equators for the SC ccndltion in this study and for Klein and 
Jarjoura' snonrepresentative anchors are shown in Table 4 along with the 
percentage of items in the total exams used in each study. An examination of 
the table shows that there is a poorer match between percentages of equators 
and total Items in both of the anchors used In the Klein and Jarjoura study 
than m the current study. It seems likely that the degree of non- 
representativeness of the anchor In the current study was not great enough to 
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affect the equating process. It should also ho noted that Klein and Jarjoura 
manipulated the match by varying the number of equators chosen from each 
content area, while In the currenc atudy che lack of match was due to che 
allkeylng of non-equating items. Although both manipulations result in a lack 
of match between percencages of equator a and total test itenis, the effect on 
the equating process may not be the same. 

In Bummary, this research examined the effect on linear equating of 
allkeylng test items in a national standardized licensure nastlng program. 
The results showed that mean scaled scores remained virtually unchanged over 
all allkeylng conditions examined. The linear equating procedures were 
sufficiently robust to withstand the violations of equating assumptions 
introduced by the manipulacions In this study. However, the allkeylng of 
flawed icems can affect individual sealed scores and should be considered only 
after an analysis of the Item content has revealed that no Justifiably correct 
response appears among the options. 
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Table 1 



Number of Allkeyed Items Across 
Connent Areas and Item Summary Stacistics 
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Table 2 



Raw and Scaled Score Means and Standard Deviations 
and Conversion Parametars for Five Allkeylng 
Condicions and Two Methods of Linear Equating 



Number of Allkeyed Items 



Tucker 
Method 



LeviTie 
Method 



Raw Score Mean 
Raw Score S,D* 

Scaled Score Mean 
Scaled Scores S.D. 
Slope 
Incercept 

Scaled Score Mean 
Scaled Score S*D, 
Slope 
Intercept 



0 


4 


10 




25 


SCO) 


123*537 


126,079 


130, 


069 


140, 103 


126.642 


16.897 


16,918 


16, 


804 


16, 172 


16,236 


137,878 


137.876 


137, 


871 


137,860 


137.873 


15.895 


15,895 


15, 


894 


15, b^^ 


15,895 


,941 


.940 




946 
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,979 


21.664 


19.422 


14, 


844 


, 166 


13,893 


136.687 


136.690 


136, 


696 


136,709 


136.694 


15,847 


15.848 


15, 


849 


15,852 


15,849 


,938 


,937 




941 


,980 


.977 


20,823 


18.588 


I4l 


020 


--.627 


13.072 
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Table 3 



Raw and Scaled Scores for Two Hypochetical 
Examinees Across Four Levels of Rekeying 



Examinee A 
Examinee B 



Number of Multiple Key^ 
4 10 



25 



Raw 
Score 



Scaled 
Score 



Raw 
Score 



Scalad 
Score 



Raw 
Score 



Scaled 
Score 



Raw 
Score 



120 
120 



135 
135 



120 
124 



132 
136 



120 
130 



128 
138 



120 
145 



Scaled 
Score 



118 

143 
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Table 4 



Dlscribution of Equators and Tocal Items Across Content Areas 



Content Areas 
1 2 3 4 5 6 



Current % of Equators (SC condition) 

Study % of Total Test Items 

Klein % of Equators (Anchor 1) 

and % of Equators (Anchor 2) 

Jarjoura % of Total Test Items 



20 


20 


15 


15 


15 


15 


16 


21 


16 


16 


16 


16 
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24 


18 


17 


12 


22 


17 


16 
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15 
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20 


10 


20 


15 


15 



75 



ERIC 



14 



REFERENCES 



Dorans, N,J- (1983) Effects on score distribucions of deleting an unkeyable item 
from a test- (Research Rep, No, 83-^5) Princenon, N,J,i ETS. 

Klein, L.W, , & Jarjoura, D* (1985) The importance of concent representation for 

common-^item equating with nonrandora groups. jQurnal of Educational Measurement, 
22, 197--206^ 



16 

o 

ERIC 



